PM2.5 California Data for 2004
data04 <- read.csv("pm25ca2004.csv") # Load in data
dim(data04) # Check dimensions
## [1] 19233 20
names(data04) # Check variable names
## [1] "Date" "Source"
## [3] "Site.ID" "POC"
## [5] "Daily.Mean.PM2.5.Concentration" "UNITS"
## [7] "DAILY_AQI_VALUE" "Site.Name"
## [9] "DAILY_OBS_COUNT" "PERCENT_COMPLETE"
## [11] "AQS_PARAMETER_CODE" "AQS_PARAMETER_DESC"
## [13] "CBSA_CODE" "CBSA_NAME"
## [15] "STATE_CODE" "STATE"
## [17] "COUNTY_CODE" "COUNTY"
## [19] "SITE_LATITUDE" "SITE_LONGITUDE"
str(data04) # Check variable types
## 'data.frame': 19233 obs. of 20 variables:
## $ Date : chr "01/01/2004" "01/02/2004" "01/03/2004" "01/04/2004" ...
## $ Source : chr "AQS" "AQS" "AQS" "AQS" ...
## $ Site.ID : int 60010007 60010007 60010007 60010007 60010007 60010007 60010007 60010007 60010007 60010007 ...
## $ POC : int 1 1 1 1 1 1 1 1 1 1 ...
## $ Daily.Mean.PM2.5.Concentration: num 8.9 12.2 16.5 18.1 11.5 32.5 14 29.9 21 16.9 ...
## $ UNITS : chr "ug/m3 LC" "ug/m3 LC" "ug/m3 LC" "ug/m3 LC" ...
## $ DAILY_AQI_VALUE : int 37 51 60 64 48 94 55 88 70 61 ...
## $ Site.Name : chr "Livermore" "Livermore" "Livermore" "Livermore" ...
## $ DAILY_OBS_COUNT : int 1 1 1 1 1 1 1 1 1 1 ...
## $ PERCENT_COMPLETE : num 100 100 100 100 100 100 100 100 100 100 ...
## $ AQS_PARAMETER_CODE : int 88101 88502 88502 88101 88502 88502 88101 88502 88502 88502 ...
## $ AQS_PARAMETER_DESC : chr "PM2.5 - Local Conditions" "Acceptable PM2.5 AQI & Speciation Mass" "Acceptable PM2.5 AQI & Speciation Mass" "PM2.5 - Local Conditions" ...
## $ CBSA_CODE : int 41860 41860 41860 41860 41860 41860 41860 41860 41860 41860 ...
## $ CBSA_NAME : chr "San Francisco-Oakland-Hayward, CA" "San Francisco-Oakland-Hayward, CA" "San Francisco-Oakland-Hayward, CA" "San Francisco-Oakland-Hayward, CA" ...
## $ STATE_CODE : int 6 6 6 6 6 6 6 6 6 6 ...
## $ STATE : chr "California" "California" "California" "California" ...
## $ COUNTY_CODE : int 1 1 1 1 1 1 1 1 1 1 ...
## $ COUNTY : chr "Alameda" "Alameda" "Alameda" "Alameda" ...
## $ SITE_LATITUDE : num 37.7 37.7 37.7 37.7 37.7 ...
## $ SITE_LONGITUDE : num -122 -122 -122 -122 -122 ...
PM2.5 California Data for 2019
data19 <- read.csv("pm25ca2019.csv")
dim(data19)
## [1] 53086 20
head(data19)
tail(data19)
names(data19)
## [1] "Date" "Source"
## [3] "Site.ID" "POC"
## [5] "Daily.Mean.PM2.5.Concentration" "UNITS"
## [7] "DAILY_AQI_VALUE" "Site.Name"
## [9] "DAILY_OBS_COUNT" "PERCENT_COMPLETE"
## [11] "AQS_PARAMETER_CODE" "AQS_PARAMETER_DESC"
## [13] "CBSA_CODE" "CBSA_NAME"
## [15] "STATE_CODE" "STATE"
## [17] "COUNTY_CODE" "COUNTY"
## [19] "SITE_LATITUDE" "SITE_LONGITUDE"
str(data19)
## 'data.frame': 53086 obs. of 20 variables:
## $ Date : chr "01/01/2019" "01/02/2019" "01/03/2019" "01/04/2019" ...
## $ Source : chr "AQS" "AQS" "AQS" "AQS" ...
## $ Site.ID : int 60010007 60010007 60010007 60010007 60010007 60010007 60010007 60010007 60010007 60010007 ...
## $ POC : int 3 3 3 3 3 3 3 3 3 3 ...
## $ Daily.Mean.PM2.5.Concentration: num 5.7 11.9 20.1 28.8 11.2 2.7 2.8 7 3.1 7.1 ...
## $ UNITS : chr "ug/m3 LC" "ug/m3 LC" "ug/m3 LC" "ug/m3 LC" ...
## $ DAILY_AQI_VALUE : int 24 50 68 86 47 11 12 29 13 30 ...
## $ Site.Name : chr "Livermore" "Livermore" "Livermore" "Livermore" ...
## $ DAILY_OBS_COUNT : int 1 1 1 1 1 1 1 1 1 1 ...
## $ PERCENT_COMPLETE : num 100 100 100 100 100 100 100 100 100 100 ...
## $ AQS_PARAMETER_CODE : int 88101 88101 88101 88101 88101 88101 88101 88101 88101 88101 ...
## $ AQS_PARAMETER_DESC : chr "PM2.5 - Local Conditions" "PM2.5 - Local Conditions" "PM2.5 - Local Conditions" "PM2.5 - Local Conditions" ...
## $ CBSA_CODE : int 41860 41860 41860 41860 41860 41860 41860 41860 41860 41860 ...
## $ CBSA_NAME : chr "San Francisco-Oakland-Hayward, CA" "San Francisco-Oakland-Hayward, CA" "San Francisco-Oakland-Hayward, CA" "San Francisco-Oakland-Hayward, CA" ...
## $ STATE_CODE : int 6 6 6 6 6 6 6 6 6 6 ...
## $ STATE : chr "California" "California" "California" "California" ...
## $ COUNTY_CODE : int 1 1 1 1 1 1 1 1 1 1 ...
## $ COUNTY : chr "Alameda" "Alameda" "Alameda" "Alameda" ...
## $ SITE_LATITUDE : num 37.7 37.7 37.7 37.7 37.7 ...
## $ SITE_LONGITUDE : num -122 -122 -122 -122 -122 ...
Checking The Key Variable from Each Data Set
summary(data04) # Summarize variables to gain insight
## Date Source Site.ID POC
## Length:19233 Length:19233 Min. :60010007 Min. : 1.000
## Class :character Class :character 1st Qu.:60370002 1st Qu.: 1.000
## Mode :character Mode :character Median :60658001 Median : 1.000
## Mean :60588026 Mean : 1.816
## 3rd Qu.:60750006 3rd Qu.: 2.000
## Max. :61131003 Max. :12.000
##
## Daily.Mean.PM2.5.Concentration UNITS DAILY_AQI_VALUE
## Min. : -0.10 Length:19233 Min. : 0.00
## 1st Qu.: 6.00 Class :character 1st Qu.: 25.00
## Median : 10.10 Mode :character Median : 42.00
## Mean : 13.13 Mean : 46.33
## 3rd Qu.: 16.30 3rd Qu.: 60.00
## Max. :251.00 Max. :301.00
##
## Site.Name DAILY_OBS_COUNT PERCENT_COMPLETE AQS_PARAMETER_CODE
## Length:19233 Min. :1 Min. :100 Min. :88101
## Class :character 1st Qu.:1 1st Qu.:100 1st Qu.:88101
## Mode :character Median :1 Median :100 Median :88101
## Mean :1 Mean :100 Mean :88267
## 3rd Qu.:1 3rd Qu.:100 3rd Qu.:88502
## Max. :1 Max. :100 Max. :88502
##
## AQS_PARAMETER_DESC CBSA_CODE CBSA_NAME STATE_CODE
## Length:19233 Min. :12540 Length:19233 Min. :6
## Class :character 1st Qu.:31080 Class :character 1st Qu.:6
## Mode :character Median :40140 Mode :character Median :6
## Mean :35328 Mean :6
## 3rd Qu.:41860 3rd Qu.:6
## Max. :49700 Max. :6
## NA's :1253
## STATE COUNTY_CODE COUNTY SITE_LATITUDE
## Length:19233 Min. : 1.00 Length:19233 Min. :32.63
## Class :character 1st Qu.: 37.00 Class :character 1st Qu.:34.07
## Mode :character Median : 65.00 Mode :character Median :36.48
## Mean : 58.63 Mean :36.23
## 3rd Qu.: 75.00 3rd Qu.:38.10
## Max. :113.00 Max. :41.71
##
## SITE_LONGITUDE
## Min. :-124.2
## 1st Qu.:-121.6
## Median :-119.3
## Mean :-119.7
## 3rd Qu.:-117.9
## Max. :-115.5
##
summary(data04$Daily.Mean.PM2.5.Concentration) # Summarize key variable
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.10 6.00 10.10 13.13 16.30 251.00
sum(is.na(data04)) # Check if data has any missing values
## [1] 1253
sum(is.na(data04$Daily.Mean.PM2.5.Concentration)) # Check key variable too
## [1] 0
summary(data19)
## Date Source Site.ID POC
## Length:53086 Length:53086 Min. :60010007 Min. : 1.000
## Class :character Class :character 1st Qu.:60310004 1st Qu.: 1.000
## Mode :character Mode :character Median :60612003 Median : 3.000
## Mean :60565291 Mean : 2.562
## 3rd Qu.:60771002 3rd Qu.: 3.000
## Max. :61131003 Max. :21.000
##
## Daily.Mean.PM2.5.Concentration UNITS DAILY_AQI_VALUE
## Min. : -2.200 Length:53086 Min. : 0.00
## 1st Qu.: 4.000 Class :character 1st Qu.: 17.00
## Median : 6.500 Mode :character Median : 27.00
## Mean : 7.733 Mean : 30.56
## 3rd Qu.: 9.900 3rd Qu.: 41.00
## Max. :120.900 Max. :185.00
##
## Site.Name DAILY_OBS_COUNT PERCENT_COMPLETE AQS_PARAMETER_CODE
## Length:53086 Min. :1 Min. :100 Min. :88101
## Class :character 1st Qu.:1 1st Qu.:100 1st Qu.:88101
## Mode :character Median :1 Median :100 Median :88101
## Mean :1 Mean :100 Mean :88214
## 3rd Qu.:1 3rd Qu.:100 3rd Qu.:88502
## Max. :1 Max. :100 Max. :88502
##
## AQS_PARAMETER_DESC CBSA_CODE CBSA_NAME STATE_CODE
## Length:53086 Min. :12540 Length:53086 Min. :6
## Class :character 1st Qu.:31080 Class :character 1st Qu.:6
## Mode :character Median :40140 Mode :character Median :6
## Mean :35841 Mean :6
## 3rd Qu.:41860 3rd Qu.:6
## Max. :49700 Max. :6
## NA's :4181
## STATE COUNTY_CODE COUNTY SITE_LATITUDE
## Length:53086 Min. : 1.00 Length:53086 Min. :32.58
## Class :character 1st Qu.: 31.00 Class :character 1st Qu.:34.14
## Mode :character Median : 61.00 Mode :character Median :36.63
## Mean : 56.39 Mean :36.35
## 3rd Qu.: 77.00 3rd Qu.:37.97
## Max. :113.00 Max. :41.76
##
## SITE_LONGITUDE
## Min. :-124.2
## 1st Qu.:-121.6
## Median :-119.8
## Mean :-119.8
## 3rd Qu.:-118.1
## Max. :-115.5
##
summary(data19$Daily.Mean.PM2.5.Concentration)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -2.200 4.000 6.500 7.733 9.900 120.900
sum(is.na(data19))
## [1] 4181
sum(is.na(data19$Daily.Mean.PM2.5.Concentration))
## [1] 0
Based on the summary of our key variable Daily.Mean.PM2.5.Concentration, minimum values reach below values of zero, which is not possible for something like particulate matter concentration. Because of this, we will also run extra code that removes these instances from our data sets. Furthermore, there was no missing data for Daily.Mean.PM2.5.Concentration, so there is no need for removal of missing data.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.3 ✓ purrr 0.3.4
## ✓ tibble 3.1.0 ✓ stringr 1.4.0
## ✓ tidyr 1.1.3 ✓ forcats 0.5.1
## ✓ readr 1.4.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(data.table)
##
## Attaching package: 'data.table'
## The following object is masked from 'package:purrr':
##
## transpose
## The following objects are masked from 'package:dplyr':
##
## between, first, last
data04 <- data04 %>%
filter(Daily.Mean.PM2.5.Concentration > 0)
summary(data04$Daily.Mean.PM2.5.Concentration) # Recheck minimum value
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.10 6.00 10.10 13.14 16.30 251.00
setorder(data04, Daily.Mean.PM2.5.Concentration) # Order by PM conc.
head(data04) # Check headers after ordering
tail(data04) # Check footers after ordering
data19 <- data19 %>%
filter(Daily.Mean.PM2.5.Concentration > 0)
summary(data19$Daily.Mean.PM2.5.Concentration)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.10 4.10 6.50 7.79 10.00 120.90
setorder(data19, Daily.Mean.PM2.5.Concentration)
head(data19)
tail(data19)
After filtering out negative concentration values and ordering the data set, head() and tail() confirmed that the data removed negative values and is ordered. The data can now be deemed ready for analysis—but the next step will combine our two frames.
Combining the Two Years of Data
dataMerged <- merge(x = data04,y = data19, all=TRUE) # Merge two sets
dataMerged$Date <- as.Date(dataMerged$Date,"%m/%d/%Y") #Reformat as date
dataMerged$Year <- as.numeric(format(dataMerged$Date,'%Y')) # Extract year
summary(dataMerged$Year) # Confirm new column
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2004 2004 2019 2015 2019 2019
Changing Names of Variables
dataMerged <-
rename(dataMerged,
dailyConc = Daily.Mean.PM2.5.Concentration, # Renaming of column structure
lat = SITE_LATITUDE,
long = SITE_LONGITUDE,
siteName = Site.Name,
siteID = Site.ID)
Locations of Sites Using leaflet()
library(leaflet)
temp.pal <- colorFactor(c("#A3C41B", "#438A7D"), domain = dataMerged$Year) # Palette creation
leaflet(dataMerged) %>% # Chunk adapted from lecture and online examples of leaflet()
addProviderTiles('CartoDB.Positron') %>%
addCircles(
lat = ~lat, lng = ~long, color = ~temp.pal(Year),
opacity = 1, fillOpacity = 1, radius = 100) %>%
addLegend('bottomleft', pal=temp.pal, values=dataMerged$Year,
title='Temperature, C', opacity=1)
Based on the produced map, there are clearly more sites in 2019 than 2004, which is also evident with the earlier dimension checks of the respective data sets. This could be because of an increasing importance of monitoring pollutant data and related funding over the fifteen year gap. Additionally, the number of sites appear to be related to population density, with largely populous areas like the San Francisco Bay Area, Greater Los Angeles area, and San Diego having the most sites. This makes sense in many regards. Because pollution often follows people (transportation, polluting habits, etc.), reporting on places that likely have the most environmental damage make sense. Furthermore, if such reporting can be relevant to more residents (population density), then it would be strategic to understand an area’s pollution more. Overall, urban areas are well-represented in this combined data set.
Checking for Missing Values of PM2.5 in Combined Data Set
sum(is.na(dataMerged$dailyConc)) # Count if there are missing values
## [1] 0
mean(is.na(dataMerged$dailyConc)) # Average the missing values, if any
## [1] 0
It appears there are no missing values in the data set, for these were removed earlier in the data preparation stage of the assignment.
Checking for Implausible Values of PM2.5 in Combined Data Set
setorder(dataMerged, -dailyConc) # Arrange by decreasing order to view high anomalies
dataMerged %>%
select(Date, dailyConc, DAILY_AQI_VALUE, siteName) # Show specific data wanted to view
Extremely small concentrations of PM2.5 are possible, making them plausible for any given pollution data set. Efforts were then focused on high concentrations to ensure they were not implausible or lying outside of expected levels. By ordering daily concentrations of PM2.5 in decreasing order, the largest concentration recordings were shown at the top. The two highest measurements (251.0 ug/m3 LC and 170.4 ug/m3 LC) happened in Yosemite during the summer of 2004. AQI was similarly high on the same measurements (301 and 221, respectively), which gives more evidence that the numbers may be valid. Doing some external internet searches revealed that wildfires are frequent in the area, and AQI between 301-500 is for wildfires while PM2.5 concentration above 250 is as well ( EPA below). For this reason, implausibility seems less likely for these two instances.
Exploratory Plots and Summary Statistics by State
library(ggplot2)
# Histogram
ggplot(dataMerged, aes(x = dailyConc)) +
geom_histogram(bins = 150, fill = "gray", color = "navy") +
facet_wrap(. ~ Year) +
xlim(0,100)
## Warning: Removed 10 rows containing non-finite values (stat_bin).
## Warning: Removed 4 rows containing missing values (geom_bar).
# Boxplot
ggplot(dataMerged, aes(x = dailyConc)) +
geom_boxplot(color = "navy") +
facet_wrap(. ~ Year)
# Line plot
ggplot(dataMerged, aes(y = dailyConc, x = Date)) +
geom_line(color = "navy") +
facet_wrap(scales = "free", . ~ Year)
# Summarize
dataMerged %>%
group_by(Year) %>%
summarise(mean = mean(dailyConc), # Various summary stats
median = median(dailyConc),
sd = sd(dailyConc),
min = min(dailyConc),
max = max(dailyConc),
IQR = IQR(dailyConc))
Because this merged data set solely holds data from California (as marked when downloaded), no filtering or sorting by a geographic-related variable was needed. Therefore, only so many figures were produced from this specific spatial level. Regardless, there are definitely takeaways when viewing the difference between California in 2004 and 2019:
California reported more measurements of particular matter overall, as indicated by a substantially greater count on the created histogram
The histogram depicts a positive skew of PM2.5 concentration for both years, which is obviously preferable to a negative skew for health reasons
The boxplot demonstrates that 2004 had a larger IQR, median, as well as maximum value of daily PM2.5 concentration
The line plot exhibits a stable year of PM2.5 concentration for 2004 while 2019 appears to have more relatively high PM2.5 days every so often
According to the summary statistics created, 2004 reported a higher average PM2.5 concentration than 2019 (13.16 ug/m3 LC vs. 7.79 ug/m3 LC, respectively), as well as a higher IQR, median, standard deviation, and maximum measurement as previously discussed regarding Yosemite wildfires
Exploratory Plots and Summary Statistics by County
# Plot to compare counties among each other and between 2004 and 2019
ggplot(dataMerged) +
geom_point(mapping = aes(x = COUNTY, y = dailyConc, colour = factor(Year))) +
scale_color_manual(values=c("#A3C41B", "#438A7D")) +
labs(x = "County", y = "PM2.5 Concentration (ug/m^3 LC)") +
theme(axis.text.x = element_text(angle = 90, vjust = .5, size = 5))
# Summary
dataMerged %>%
group_by(COUNTY, Year) %>% # COUNTY before year to see same county data side-by-side
summarise(mean = mean(dailyConc),
median = median(dailyConc),
sd = sd(dailyConc),
min = min(dailyConc),
max = max(dailyConc),
IQR = IQR(dailyConc))
## `summarise()` has grouped output by 'COUNTY'. You can override using the `.groups` argument.
Based on the output when stratifying by county, it appears that majority of counties experienced an overall decrease in PM2.5 concentration between 2004 and 2019. Some appear to be anomalies, like San Joaquin and Tehama counties. This may be because some areas experienced an actual increase in pollutants or may not have been reporting in the year 2004 to begin with. Additionally, this gives us more insight to compare counties, and the figure demonstrates that Mariposa County experienced dramatically high PM2.5 concentration in 2004 compared to other counties, while Lake County has had the lowest PM2.5 concentration to date among all CA counties.
The statistical summary output confirms the proposal that majority of counties reduced their average PM2.5 concentration and provides more insight into how many also reduced their respective maximum measurements, IQRs, medians, and even standard deviations. Overall, the reduction in pollutant concentration is evident.
Exploratory Plots and Summary Statistics by site in Los Angeles
# Plot to compare counties among each other and between 2004 and 2019
dataLA <-
dataMerged %>%
filter(COUNTY == "Los Angeles") # Take out other counties' data
ggplot(dataLA) +
geom_point(mapping = aes(x = siteName, y = dailyConc, colour = factor(Year))) +
scale_color_manual(values=c("#437B8A", "#F8C228")) +
labs(x = "LA Site Name", y = "PM2.5 Concentration (ug/m^3 LC)") +
theme(axis.text.x = element_text(angle = 90, vjust = .5, size = 5))
# Summary
dataLA %>%
group_by(siteName, Year) %>%
summarise(mean = mean(dailyConc),
median = median(dailyConc),
sd = sd(dailyConc),
min = min(dailyConc),
max = max(dailyConc),
IQR = IQR(dailyConc))
## `summarise()` has grouped output by 'siteName'. You can override using the `.groups` argument.
Based on the LA-specific figure, it is clear that some sites were not around in the year 2004, with only eight sites having existed in both of the investigated years. Those eight all experienced decreases in PM2.5 concentration, and the other sites all had relatively low measurementsin 2019 regardless.
Much like the stratified county data, the summary statistics give quantified information as to how much each site changed in the fifteen year span. Not including the maximum value of the Reseda site, each statistic decreased from 2004 to 2019 for sites that had information for both. Overall, it appears Los Angeles County in particular has also experienced a decrease in particular matter concentrations.